Chunk Tagger 

Statistical Recognition of Noun Phrases 



In ESSLLI- 



Wojciech Skut and Thorsten Brants 

Universitat des Saarlandes 
Computational Linguistics 
D-66041 Saarbriicken, Germany 
{skut ,brants}@coli .uni-sb . de 

Workshop on Automated Acquisition of Syntax and Parsing, Saarbriicken, 1998 



Abstract 



We describe a stochastic approach to 
partial parsing, i.e., the recognition of 
syntactic structures of limited depth. 
The technique utilises Markov Models, 
but goes beyond usual bracketing ap- 
proaches, since it is capable of recog- 
nising not only the boundaries, but 
also the internal structure and syn- 
tactic category of simple as well as 
complex NP's, PP's, AP's and adver- 
bials. We compare tagging accuracy 
for different applications and encoding 
schemes. 

1 Motivation 

The word chunking (also partial or shallow pars- 
ing) refers to techniques used for the recognition 
of relatively simple syntactic structures, such as 
NPs, PPs, verb complexes etc. 

NP chunkers typically rely on fairly simple 
and efficient processing tools such as finite au- 
tomata or (in stochastic approaches) Markov 
Models (MMs). The output consists of struc- 
tures recognised with a high degree of certainty; 
such structures are used for further processing. 

1.1 Annotation Software 

The main motivation for the work reported in 
this paper was the development of NLP soft- 
ware for creating language resources, especially 
syntactically annotated corpora (treebanks). A 
disadvantage of symbolic tools supporting cor- 
pus annotation is that they are language spe- 
cific and have to be developed prior to actual 



annotation. For English, this is not a problem 
since there are many such tools, yet for other 
languages, serious difficulties may arise here. 

As an alternative, a bootstrapping approach 
can be taken in which, after a short phase of 
purely manual annotation, more and more au- 
tomatic procedures are implemented using sta- 
tistical NLP methods. The already annotated 
sentences serve as training data. This approach 
is highly independent of the annotation format, 
which is simply learned from training data. 

With these prerequisites, we have developed a 
stochastic parser (chunker) that recognises syn- 
tactic structures of limited depth. The tool 
is language-independent and can be used for 
parsing unrestricted text provided some mini- 
mal amount of annotated data is available. 

1.2 Overview 

In the following, we describe our stochastic ap- 
proach to NP chunking based on a generalisa- 
tion of standard POS tagging techniques (hence 
the name chunk tagger). First, we show how a 
simple bracketing method can be extended to 
recognise more complex structures and several 
types of phrases (sections || and ||). Accuracy 
for different applications and tasks is reported 
in section |4|. In section ^, we discuss the similar- 
ities and differences between our approach and 
related research. 

2 Stochastic NP Recognition 

The idea of using statistics for NP chunking 
goes back to Church (1988| ), who used corpus 
frequencies to determine the boundaries of sim- 
ple non-recursive NP's. For each pair of POS 



tags ti , tj , the probability of an NP boundary 
('[' or ']') occurring between t{ and tj is com- 
puted. On the basis of these context probabili- 
ties, the program inserts the symbols '[' and ']' 
into sequences of POS tags, yielding output of 
the following form: 

[A/ AT former/AP top/NN aide/NN] to/IN 
[Attorney /NP General/NP Edwin/NP 
Meese/NP] interceded/VBD to/TO extend/ VB 
[an/AT aircraft /NN company/NN. . . 

The accuracy of this approach is impressive. 
On the other hand, the task is not too difficult 
since recursive structures are not recognised. It 
is interesting whether this simple technique can 
be used for the recognition of more complex 
phrases. 

2.1 Beyond Simple Bracketing 

We have modified Church's approach in a way 
permitting efficient and reliable recognition of 
structures of limited depth, including complex 
prenominal adjectival and participial phrases, 
postnominal PP's and genitives, appositions, 
etc. We encode the structure in such a way that 
it can be recognised by a part-of-speech tagger, 
so the process runs in time linear to the length 
of the input string. 

The basic idea is simple enough: structures of 
limited depth are encoded using a finite number 
of flags. We employ flags standing not just for 
bracketing, but for structural relations between 
adjacent words. 

Given a sequence of words {wo,w\, ...w n ), we 
consider the structural relation r$ holding be- 
tween Wi and for 1 < % < n. For the recog- 
nition of most - even fairly complex - NPs, PPs, 
and APs, it is sufficient to distinguish the fol- 
lowing seven values of ri which uniquely identify 
sub-structures of limited depth. 



parent(wi-i) 
parent 2 (wi-i) 



if parent(wi 
+ if parent(wi 

++ if par ent(wi) = parent 3 (wi-i) 

— if parent 2 (wi) = par ent(wi-i) 

if parent 3 (wi) = par ent(wi-i) 

= if parent 2 (wi) = parent 2 (wi-i) 

1 else 



If more than one of the conditions above are 
met, the first of the corresponding tags in the 
list is assigned. The depth of structures is lim- 
ited to 3. For convenience, we give the graphical 
representation of the structural tags in figure |l|. 
A structure tagged with these symbols is shown 
in figure 0. 




r 2 = r 2 = + 






T2 = ' 



Figure 1: Structural tags r 2 assigned to W2 

Instead of the simple context frequencies 
used by Church, we employ a generalisation of 
the standard MM-based POS tagging method. 
The task of the chunker is to assign the most 
probable sequence of structural tags R = 
(ro, ri, . . . , r n ) to a sequence of POS tags T = 
(to, t\, . . . , t n ). This can be done exactly in 
the same way as the assignment of the op- 
timal POS sequence to a sequence of words 
in POS tagging, and the task is to calculate 
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Under this perspective, a standard part-of- 




em 
ART 
1 



in 

APPR 



Aviv lebender Maler 

NE ADJA NN 

- ++ + 

n Tel Aviv living painter 

'a painter living in Tel Aviv' 



AC = adpositional case marker, HD = head, MO = modi- 
fier, MPN = multi-token proper noun, NK = noun phrase 
kernel, PNC = proper noun constituent 

Figure 2: Encoding of a sample structure 



speech tagger can be trained on a syntactically 
annotated corpus with structures converted into 
structural tags (the r^'s). However, in this case 
the corresponding Markov Model has only 7 
tags (the possible values of rj), which is obvi- 
ously too coarse-grained. The precision of the 
tagger is rather disappointing: only about 77% 
of all structures are recognised correctly. 

To cope with this problem, we enrich the MM 
state with information about the POS tag U 
assigned to Wi. Now we can define structural 
tags as pairs Si = {ri,U). Such tags constitute a 
finite alphabet of symbols describing structures 
of depth < 3. 

The tagger's task is thus to assign the 
most probable sequence of structural tags 
S = (Sq, Si, S n ) to a sequence of part- 
of-speech tags T = (tn,ii, 



,t n ), hence 



argmaxP(S'lT) 

s 



argmax P(5) ■ P(T\S) 
s 



(2) 



The part-of-speech tags are encoded in the 
structural tag (the t j dimension) , so S uniquely 
determines T. Therefore, we have P(ti\Si) = 1 



if Si 



,ti) and otherwise, which simplifies 



calculations. 

The contexts are smoothed by linear inter- 
polation of unigrams, bigrams, and trigrams. 
Their weights are calculated by deleted inter- 
polation. 

3 Phrasal Categories 

A simple extension of the chunk tagger can as- 
sign phrasal categories in addition to structures. 
We enrich the state Si of the Markov Model 
with information about the category Cj of the 
node immediately dominating word W{. Thus 
Si becomes a triple (rj,ij,Cj). For example, the 
adjective lebender in figure [2] is assigned the tag 
(++, ADJA, AP). This extension also slightly 
improves the recognition of structures, cf. sec- 
tion |j. 

Further precision gain can be achieved if we 
also add some information about the category 
of the grandparent node. However, only few 
symbols can be used to encode this dimension. 
Otherwise, the tagset (all Si = {ri,U, Cj, <&)) be- 
comes too large. We achieved the best results 
with just three flags A, N and C, which indi- 
cate that parent 2 (wi) is an AP, an NP/PP and 
a coordinated constituent, respectively. In this 
format, the word Aviv in figure || is assigned the 
tag (0,NE,MPN,N). 

4 Applications and Results 

In this section, we compare results achieved for 
different applications and types of structures. 
We use the dependency-oriented NEGRA tree- 
bank ( fSkut et al., 1997| ) as training data. The 
current size of the corpus is 12,000 sentences, 
or 210,000 tokens. All of these sentences have 
been annotated without the help of the chunk 
tagger. 

The annotation scheme distinguishes 24 
phrasal categories. The POS tagset ( Thielen 



|and Schiller, 1995| ) consists of 54 tags. For tag- 
ging purposes, it has been adjusted by merging 
tags irrelevant to the chunking task and adding 



simple morphological and lexical information. 
Due to this adjustment, 1.5% more words are 
assigned the correct structural tag. 

Structures are encoded according to the 
method presented in section ||. We vary the 
number of tag dimensions (1 - 4). 

The results given in the following sections 
have been computed by spliting the corpus into 
disjoint training and test parts (90% and 10%, 
respectively). This procedure was repeated ten 
times, and the results were averaged. The ac- 
curacy measures employed are explained as fol- 
lows. 

tags: the percentage of structural tags with the 
correct value of the r$ attribute, 

bracketing: the percentage of correctly recog- 
nised nodes, 

labelled bracketing: the percentage of nodes 
recognised correctly including their syntac- 
tic category, 

top-level chunks: the percentage of correctly 
parsed "maximal" chunks, i.e., phrases not 
contained in a larger chunk of depth < 3. 

We present figures concerning the precision 
of the chunker. Recall is slightly lower for all 
applications (0.5% - 1.5%). 

4.1 Corpus Annotation 

As we already mentioned, the primary applica- 
tion of the chunk tagger is corpus (treebank) 
annotation. The treebank is being created in 
an interactive annotation mode: automatic and 
manual annotation steps are closely interleaved 
to ensure optimal control of the predictions 
made automatically (for a precise description 
of this interactive approach to treebank anno- 
tation see ( |Skut et al., 1997 )). 

As for the chunker, the interactive annota- 
tion mode means that the annotator specifies 
the boundaries of a complex NP or PP, and the 
tool recognises its category and internal struc- 
ture. Note that the disambiguation of PP at- 
tachment is significantly easier than in the gen- 
eral case. Correct structural tags are assigned to 
more than 95% of all words, which means that 



approx. 89% of all chunks (NP's, PP's, AP's) 
are assigned the correct syntactic structure. 

Precise results for different chunk encoding 
methods are given in table |. The training cor- 
pus was created by extracting all NPs, PPs and 
APs occurring in the NEGRA treebank (34,000 
chunks, 130,000 tokens). In other words, the 
program had to perform the annotator's task: 
find a labelled structure that spans a given se- 
quence of words. 

Table 1: Precision of the chunk tagger in the 
interactive annotation mode for different chunk 
encoding methods. The symbols in brackets de- 
note: r structural relation (7 values), t POS tag 
(54 values), c parent node category (24 values), 
g grandparent node category (3 values). 



dimensions 


tags (ri) 


brack. 


1. brack. 


(r) 


87.8% 


76.6% 




(r, c,g) 


92.4% 


83.9% 


78.1% 


(r,t) 


94.7% 


88.3% 




(r, t, c) 


94.9% 


88.7% 


84.7% 


(r, t, c, g) 


95.1% 


89.2% 


85.2% 



It can be seen from the table that part-of- 
speech information (t) is much more relevant 
than information about phrasal categories (c 
and g). The latter also enhances the per- 
formance of the tagger, but the improvement 
achieved is rather modest. 

The tagset size and average ambiguity for the 
five encoding schemes are shown in table 

Table 2: Tagset sizes and ambiguity. 



dimensions 


# tags 


tags per word 


(r) 


7 


4.5 


(r, c, g) 


125 


24.9 


(r,t) 


251 


4.5 


(r, t, c) 


775 


18.7 


(r,t,c,g) 


996 


24.9 



With a unigram model, i.e. choosing the 



most probable tag without looking at the con- 
text, tag assignment precision is only 45.8% 
(for S = (r,t,c,g)). The precision of a bigram 
model is 94.3%. Thus the difference to the tri- 
gram model is very small, which obviously re- 
sults from the fairly large amount of information 
encoded in a single chunk tag (structural rela- 
tion, POS tag and phrasal category), so that 
a trigram context does not contain much more 
information than a bigram one. 

4.2 Tagging the Penn Treebank 

In order to better evaluate the performance of 
the chunk tagger, we applied it to a fragment 
of the Penn Treebank. As in the evaluation 
reported in the previous section, the training 
corpus consisted of all NP's, PP's and AP's oc- 
curring in the Treebank fragment. In the table 
below, the results are contrasted with those of 
chunk tagging the NEGRA corpus. 



Table 3: Precision for different corpora in the in- 
teractive annotation mode (Penn Treebank and 
NEGRA Corpus). Information about external 
phrase boundaries is supplied by the annotator. 



corpus 


PTB 


NEGRA 


# sentences 


10,000 


12,000 


# top-level chunks 


33,808 


33,787 


# phrasal nodes 


88,083 


56,110 


tags (n) 


93.0% 


95.1% 


bracketing 


91.1% 


89.2% 


lab. bracketing 


86.7% 


85.2% 


top-level chunks 


81.3% 


88.8% 



The figures show that the top-level chunk 
recognition rate is significantly better for the 
NEGRA corpus data. The difference seems to 
arise from the higher structural complexity of 
the Penn Treebank fragment, where a chunk on 
average contains 2.56 phrasal nodes (as opposed 
to 1.65 in the NEGRA corpus, which does not 
contain unary projections). 



4.3 Other Applications 

The chunk tagger can also be used as a stand- 
alone application, i.e., for the recognition of sim- 
ple structures in text. This task is obviously 
more difficult since all phrase boundaries have 
to be recognised by the tagger. As a result, pre- 
cision drops significantly, cf. table 



Table 4: Precision of the chunk tagger with 
PP/NP/adverb attachment. No pre-editing by 
a human annotator. 



measure 


correct 


structural tags (r») 


90.9% 


bracketing 


75.4% 


labelled bracketing 


72.6% 


top-level chunks 


71.1% 



To find the main sources of errors, we exam- 
ined the results and found that erroneous out- 
put mostly originated from wrong PP attach- 
ment. Furthermore, many errors were due to 
coordination and focus adverbs (e.g., nur 'only', 
auch 'also', etc.), which introduce a high ambi- 
guity potential. 

Since the disambiguation of such attachments 
involves lexical and even world knowledge, PP 
and focus adverb attachment cannot be recog- 
nised in a satisfactory way by a MM-based tag- 
ger operating on POS tags. Thus the best 
strategy is to postpone attaching PP's and ad- 
verbs, and make the chunk tagger recognise the 
prenominal part of NP's and PP's only. With 
this modification, precision increases to more 
than 95%. Exact results are given in table ^. 

If we ignore errors concerning the internal 
structure of the chunks (i.e., we measure only 
the recognition of external boundaries, which 
corresponds to the precision measurement in 
some other approaches), 93.4% of all chunks are 
assigned the correct part of the input string. 

4.4 Size of the Training Corpus 

An important advantage of the chunker is that 
it is independent of theory-internal representa- 
tions and can be used to recognise structures of 
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Figure 3: Precision as percentage of correctly recognised top-level chunks of depth 2 and 3, shown 
for different numbers of training sentences. 



Table 5: Precision of the chunk tagger without 
PP/NP/adverb attachment. No pre-editing by 
a human annotator. 



measure 


correct 


structural tags (r^) 


95.5 


% 


bracketing 


89.3 


% 


labelled bracketing 


86.2 


% 


top-level chunks 


89.0 


% 



any language. Of course, the availability of a 
training corpus is a prerequisite. Now we shall 
see how much data is necessary to achieve reli- 
able results. 

Figure |3] shows precision (measured as the 
percentage of top-level chunks recognised cor- 
rectly) for the interactive annotation mode. We 
consider two encoding schemes. The depth 3 
scheme is the one described in section which 
uses all the 7 possible values of the r« dimension. 
The depth 2 scheme is its slightly simplified ver- 
sion in which r, can take only four values: 1, 0, 
+, -, so that only depth- two trees are recog- 
nised by the chunker. 

While for the depth 3 encoding a training cor- 
pus of 1000-2000 sentences is needed, the sim- 
pler encoding requires only about 500 sentences. 
Thus the chunk tagger can be successfully used 



in treebank annotation at quite an early stage, 
with only a few hundred annotated sentences 
available. 

5 Related Work 

In section 0, we sketched the simple brack- 



eting technique described by Church (1988| ), 
which provided motivation for our chunking 
method. As far as other approaches are con- 
cerned, our work is most closely related to that 
of Joshi and Srinivas (1994), who use Markov 
Models in a preprocessing step to reduce the 
number of tree segments (called supertags) that 
can be assigned to a word in a lexicalised Tree 
Adjoining Grammar. This approach makes 
parsing more efficient, but it needs a large train- 
ing corpus, has to fight a large amount of ambi- 
guity and needs a subsequent parsing step (also 



see ( Srinivas, 1996 ) for the use of explanation- 
based learning for this purpose). 

Symbolic NP chunkers usually rely on 
finite automata and/or pattern matching, 
cf. (Koskenniemi, 1990), ( A'it-Mokhtar and 



|Chanod, 1997 ). Abney (1996j ) presents a partial 
parsing technique based on cascaded finite au- 
tomata. |Voutilainen and Padro (1997| ) describe 
a POS tagger and shallow parser combining 
symbolic and stochastic processing via relax- 
ation labelling. 

The precision of the abovementioned ap- 



proaches is often measured by the number of 
correct labels assigned to words. The figures 
range from 85% to 98%. Our results (89% 
- 95%) fit into this interval, yet it should 
be kept in mind that the coverage of the ap- 
proaches and the precision measuring methods 
are often only roughly comparable. For in- 
stance, several shallow parsing methods are re- 
stricted to POS tagging and grammatical func- 
tion labelling without explicitly specifying at- 
tachments and phrase boundaries. Further- 
more, the notion of 'phrase' varies in these in- 
vestigations, and usually these studies concen- 
trate on simple, non-recursive structures. By 
contrast, our chunker is capable of recognizing 
complex, even recursive, NPs, PPs, and APs. 

Compared to the symbolic techniques, an im- 
portant advantage of the stochastic approach 
taken in our project is its independence of ex- 
ternal lexical resources. As a result, the chunker 
trained with the POS-tags and structures of the 
current corpus is fairly domain-independent. Of 
course, our tool would benefit from the use of 
lexical knowledge; this issue has to be addressed 
in the near future. 

Since our approach is restricted to a small 
number of structurally different tags, it has 
the great advantage of requiring only a small 
amount of training data (cf. section ||) and the 
recognition of these phrases is of high accuracy. 

6 Conclusion 

We have presented a stochastic partial parser 
[chunker) that recognises the boundaries, inter- 
nal structure and syntactic category of simple 
as well as fairly complex NP's, PP's and AP's. 
The chunker is a straightforward application of 
a stochastic part-of-speech tagger. We use it 
to model a mapping from lexical categories to 
syntactic structures. The type of the structural 
encoding is crucial in this approach, and better 
encodings increase the accuracy of structural as- 
signment. The use of Markov Model processing 
techniques guarantees that the process runs in 
time linear to the length of the input string. 
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